Technical Infrastructure at Linguistic Data Consortium: Software and Hardware Resources for Linguistic Data Creation
نویسندگان
چکیده
Linguistic Data Consortium (LDC) at the University of Pennsylvania has participated as a data provider in a variety of governmentsponsored programs that support development of Human Language Technologies. As the number of projects increases, the quantity and variety of the data LDC produces have increased dramatically in recent years. In this paper, we describe the technical infrastructure, both hardware and software, that LDC has built to support these complex, large-scale linguistic data creation efforts at LDC. As it would not be possible to cover all aspects of LDC’s technical infrastructure in one paper, this paper focuses on recent development. We also report on our plans for making our custom-built software resources available to the community as open source software, and introduce an initiative to collaborate with software developers outside LDC. We hope that our approaches and software resources will be useful to the community members who take on similar challenges.
منابع مشابه
Annotation Tool Development for Large-Scale Corpus Creation Projects at the Linguistic Data Consortium
The Linguistic Data Consortium (LDC) creates a variety of linguistic resources – data, annotations, tools, standards and best practices – for many sponsored projects. The programming staff at LDC has created the tools and technical infrastructures to support the data creation efforts for these projects, creating tools and technical infrastructures for all aspects of data creation projects: data...
متن کاملEnhanced Infrastructure for Creation and Collection of Translation Resources
Statistical Machine Translation (MT) systems have achieved impressive results in recent years, due in large part to the increasing availability of parallel text for system training and development. This paper describes recent efforts at Linguistic Data Consortium to create linguistic resources for MT, including corpora, specifications and resource infrastructure. We review LDC's three-pronged a...
متن کاملLinguistic Resources for Meeting Speech Recognition
This paper describes efforts by the University of Pennsylvania's Linguistic Data Consortium to create and distribute shared linguistic resources – including data, annotations, tools and infrastructure – to support the Rich Transcription 2005 Spring Meeting Recognition Evaluation. In addition to distributing large volumes of training data, LDC produced reference transcripts for the RT-05S confer...
متن کاملA Progress Report from the Linguistic Data Consortium: Recent Activities in Resource Creation and Distribution and the Development of Tools and Standards
This paper described recent activities of the Linguistic Data Consortium in the collection, annotation and distribution of language data the developments of tools and standards for using that data, the creation of metadata to facilitate the search for linguistic resources.
متن کاملLinguistic Resources for the Meeting Domain
This paper describes efforts by the University of Pennsylvania's Linguistic Data Consortium to create and distribute shared linguistic resources – including data, annotations, and tools – to support the Spring 2009 (RT-09) Rich Transcription Meeting Recognition Evaluation. In addition to making available large volumes of training data to research participants, LDC produced reference transcripts...
متن کامل